This document explains the methodology of the correlations analysis for San Jose social distancing compliance and summarizes key results. It uses data on social distancing through 5/10/2020.

library(tidyverse)
library(plotly)
library(sf)
library(mapview)
library(tigris)
library(censusapi)
library(leaflet)
library(lehdr)
library(usmap)


options(
  tigris_class = "sf",
  tigris_use_cache = TRUE
)

Key insights

Methodology

The data used for social distancing compliance comes from Safegraph’s social distancing dataset. This dataset tracks individuals’ cell phone GPS location. We have access to the aggregate behavior of movement that is at the census block group scale. In this analysis, we use the “completely at home” field, which Safegraph defines as a count of devices that did not leave their usual nighttime location (see documentation at https://docs.safegraph.com/docs/social-distancing-metrics). For each census block group in San Jose, we calculate the average percent of devices completely at home on weekdays since the start of the Bay Area shelter-in-place order (3/16/2020), as well as the percent of devices completely at home on weekdays during the months of January and February 2020, prior to the shelter-in-place order and widespread COVID-19 concerns. From these results, we obtain the percent of devices leaving home during these time periods.

In our analysis, we quantify the correlations between percentage of devices leaving their respective homes before and after the shelter in-place-order was instated and census block group averages of various demographic variables, including income, age, language ability, race, ethnicity, education level, vehicle ownership, occupants per room in a household, sex of workers, and high speed internet access. We choose to analyze these demographic variables based on our literature review of existing research on correlations with social distancing and our hypotheses of potentially relevant factors in San Jose and the Bay Area. Information on these demographic variables at the census block group level is from the American Community Survey 2014-2018 5 year summary data. We also quantify the correlations between these demographic variables and the change in percent of devices staying completely at home after the shelter-in-place order relative to before the order. This latter metric would suggest the ability of a community to alter their behavior to comply with the shelter-in-place order.

Results

Income is a strong predictor of the percentage of devices leaving home

Income is a strong predictor of percentage of devices leaving home during the shelter-in-place order period. We consider different income thresholds, and conclude that the percentage of households in a block group earning over $125,000 annually is the best predictor. In the graph below, we plot the percentage of devices leaving home with the percentage of households making over $125,000, with each block group represented as a point on the graph. The best fit linear trendline is shown in orange. The slider on the bottom switches the data between percentage of devices leaving home before the shelter-in-place order to after the shelter-in-place order.

# load data
sj_dem_distancing_pre_post <- readRDS("/Users/simonespeizer/Documents/2020 Spring Quarter/CEE 218Z/covid19/Simone_Speizer/sj_socialdistancing_demdata_prepostdifs_manyvars.rds")

# combine the data so that plots can be animated with trendlines
# get the before shelter in place data
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home pre shelter`, blockgroup) 
sj_dem_distancing_pre_shelter[is.na(sj_dem_distancing_pre_shelter)] <- 0

#  relabel column
colnames(sj_dem_distancing_pre_shelter)[1] <- "% leaving home"

# add back demographic variables
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>% left_join(sj_dem_distancing_pre_post) 

# get trendlines
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>%
  mutate(
    income_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% over 125,000`)),
    hispanic_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% hispanic/latino`)),
    educ_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`percent associates or higher`))) %>%
  cbind(`Before or After Shelter-in-Place` = "Before shelter-in-place")

# repeat for post shelter in place
sj_dem_distancing_post_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home`, blockgroup) 

sj_dem_distancing_post_shelter[is.na(sj_dem_distancing_post_shelter)] <- 0

#  relabel column
colnames(sj_dem_distancing_post_shelter)[1] <- "% leaving home"

# add back demographic variables
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>% left_join(sj_dem_distancing_pre_post) 

# get trendlines
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>%
  mutate(
    income_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% over 125,000`)),
    hispanic_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% hispanic/latino`)),
    educ_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`percent associates or higher`))) %>%
  cbind(`Before or After Shelter-in-Place` = "After shelter-in-place")

# combine them
sj_dem_distancing_pre_post_separate <- rbind(sj_dem_distancing_pre_shelter, sj_dem_distancing_post_shelter)

# convert the before/after column to factor so it shows up correctly on the plots
sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place` <- factor(sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place`, levels = c("Before shelter-in-place", "After shelter-in-place"))


fig_income <- 
  plot_ly (sj_dem_distancing_pre_post_separate) %>%
    add_trace(
      x = ~`% over 125,000`, 
      y = ~`% leaving home`, 
      frame = ~`Before or After Shelter-in-Place`, 
      type = 'scatter', 
      mode = 'markers', 
      showlegend = F
    ) %>% 
    add_trace(
      x = ~`% over 125,000`,
      y = ~income_trendline,
      type = 'scatter',
      mode = 'lines',
      line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
      frame = ~`Before or After Shelter-in-Place`,
      showlegend = F
    ) %>% 
  animation_button(visible = F) %>%
  animation_slider(
    pad = list(t =75),
    currentvalue = list(visible=F)
  ) %>% 
  layout(xaxis = list(title = 'Percentage of households making over $125,000'), yaxis = list(title = 'Percentage of devices leaving home'), margin = list(l = 75,r = 75))

fig_income

From this figure, we see that during the shelter-in-place order period a higher percentage of households making over $125,000 in a block group correlates with fewer devices leaving the home in that block group. This trend is the opposite of that observed prior to the shelter-in-place order, suggesting that block groups with a greater percentage of households of higher income were more able to adjust their behavior to comport with the shelter-in-place order.

To better assess this relative change in behavior, we fit a linear model to the change in devices staying completely at home after the shelter-in-place order (relative to before the order) and the percentage of households earning more than $125,000. The results of that model, including the coefficient on income (the slope of the linear fit) and the R-squared value, are shown below.

Coefficient:

income_125_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(income_125_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)       13.3871    0.75357   17.76 1.840e-56
## `% over 125,000`   0.3068    0.01665   18.43 9.009e-60

R-squared:

print(summary.lm(income_125_model_dif)$r.squared, digits  = 4)
## [1] 0.3746

From the coefficient value, we see that as the percentage of households with incomes over $125,000 in a block group increases by 1%, the difference between the percentage of devices staying completely at home after the shelter-in-place order and the percentage completely at home before the order increases by about 0.31%. The R-squared value assesses the degree to which this model accurately predicts the variation in change in devices staying completely at home observed in the data; the result of 0.37 indicates that the linear fit with income predicts about 37% of the observed variation. The low p value indicates that these results are statistically significant. This is a relatively strong prediction, even without examining the effect of other demographic variables.

Education level is also a strong predictor, but is highly correlated with income

Education level–specifically percentage of individuals in a block group that have a degree at the Associate’s level or higher–also well predicts percentage of devices leaving home during the shelter-in-place order period.

fig_educ <- 
  plot_ly (sj_dem_distancing_pre_post_separate) %>%
    add_trace(
      x = ~`percent associates or higher`, 
      y = ~`% leaving home`, 
      frame = ~`Before or After Shelter-in-Place`, 
      type = 'scatter', 
      mode = 'markers', 
      showlegend = F
    ) %>% 
    add_trace(
      x = ~`percent associates or higher`,
      y = ~educ_trendline,
      type = 'scatter',
      mode = 'lines',
      line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
      frame = ~`Before or After Shelter-in-Place`,
      showlegend = F
    ) %>% 
  animation_button(visible = F) %>%
  animation_slider(
    pad = list(t =75),
    currentvalue = list(visible=F)
  ) %>% 
  layout(xaxis = list(title = 'Percentage of individuals with an Associate Degree or higher'), yaxis = list(title = 'Percentage of devices leaving home'), margin = list(l = 75,r = 75))

fig_educ

Similar to the correlation with income, during the shelter-in-place order period a higher percentage of individuals with degrees at the Associate’s level or higher in a block group correlates with fewer devices leaving home in that block group. This is the opposite of the trend present prior to the shelter-in-place order.

The results of the linear model fitting the change in percentage of devices staying completely at home and the percentage of individuals with degrees at the Associate’s level or higher are shown below.

Coefficient:

educ_model_dif <- lm(`% increase in staying completely home` ~ `percent associates or higher`, sj_dem_distancing_pre_post)
print(summary.lm(educ_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                      12.924    0.87475   14.77 5.029e-42
## `percent associates or higher`    0.277    0.01716   16.14 1.737e-48

R-squared:

print(summary.lm(educ_model_dif)$r.squared, digits  = 4)
## [1] 0.3147

As the percentage of individuals with degrees at the Associate’s level or higher in a block group increases by 1%, the change in percentage of devices staying completely at home increases by about 0.28%. This linear model with education predicts about 31% of the observed variation in change in percentage of devices staying completely at home. Again, the low p value indicates that the correlation is statistically significant.

As noted, this trend is very similar to that seen in the income data. We also expect income and education to be highly correlated. Thus, it is possible that education level may not provide much more predictive information than income already does. To assess this, we perform a multiple regression analysis on these data. In a multiple regression analysis, combining multiple variables into a single model will either suggest that all the variables included have some explanatory power, or will indicate that once the effect of one or more of the variables is accounted for, some of the other variables lose their predictive ability. The results of the multiple regression analysis for income and education with change in percentage of devices staying completely at home are shown below.

Coefficients:

educ_income_model_dif <- lm(`% increase in staying completely home` ~ `percent associates or higher` + `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(educ_income_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                     11.1274    0.83575   13.31 2.221e-35
## `percent associates or higher`   0.1272    0.02251    5.65 2.549e-08
## `% over 125,000`                 0.2158    0.02286    9.44 9.623e-20

R-squared:

print(summary.lm(educ_income_model_dif)$r.squared, digits  = 4)
## [1] 0.408

The model with both education and income predicts 41% of the variation in the change in percentage of devices staying completely at home. Note that this is an improvement of 4% over the model with just income, which predicted 37% of the variation; this suggests that education and income are indeed strongly correlated, but that education does offer some additional predictive ability beyond that provided by income. Note also that the coefficient on income is larger than the coefficient on education, suggesting that its effect is larger.

Hispanic/Latino population appears to be a strong predictor, but loses its predictive ability when combined with income, education, and Asian population

We next consider correlations between the racial and ethnic makeup of a block group and its social distancing compliance. We focus our analysis on Hispanic/Latino population, Asian population, and White population, as these are the main races/ethnicities in San Jose. Note that these categories are defined in accordance with the US Census descriptions and as such the race variables refer to an individual’s origins.

We find that percentage of residents of a block group that are White does not correlate with change in leaving home behavior of that block group (R-squared 0.005). Higher percentage of residents that are Asian has a weak positive correlation with fewer devices leaving home following the shelter-in-place order (coefficient 0.14, R-squared 0.1). In contrast, higher percentage of residents that are Hispanic/Latino in a block group correlates with more devices leaving home following the shelter-in-place order, as shown in the graph below.

fig_hisp <- 
  plot_ly (sj_dem_distancing_pre_post_separate) %>%
    add_trace(
      x = ~`% hispanic/latino`, 
      y = ~`% leaving home`, 
      frame = ~`Before or After Shelter-in-Place`, 
      type = 'scatter', 
      mode = 'markers', 
      showlegend = F
    ) %>% 
    add_trace(
      x = ~`% hispanic/latino`,
      y = ~hispanic_trendline,
      type = 'scatter',
      mode = 'lines',
      line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
      frame = ~`Before or After Shelter-in-Place`,
      showlegend = F
    ) %>% 
  animation_button(visible = F) %>%
  animation_slider(
    pad = list(t =75),
    currentvalue = list(visible=F)
  ) %>% 
  layout(xaxis = list(title = 'Percent of residents that are Hispanic or Latino', autorange = "reversed"), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))

fig_hisp

The results of the linear model fitting the change in percentage of devices staying completely at home and the percentage of residents that are Hispanic/Latino are shown below.

Coefficient:

hispanic_model_dif <- lm(`% increase in staying completely home` ~ `% hispanic/latino`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                     Estimate Std. Error t value   Pr(>|t|)
## (Intercept)          33.2654     0.5866   56.71 7.914e-236
## `% hispanic/latino`  -0.2278     0.0150  -15.18  6.276e-44

R-squared:

print(summary.lm(hispanic_model_dif)$r.squared, digits  = 4)
## [1] 0.289

As the percentage of individuals that are Hispanic/Latino in a block group increases by 1%, the change in percentage of devices staying completely at home decreases by about 0.23%, meaning that a higher percentage of Hispanic/Latino residents is associated with a smaller increase in staying completely at home. This linear model with the Hispanic/Latino population of a block group predicts about 29% of the observed variation in the change in devices staying completely at home, comparable with the linear fit for education that was previously shown. The results are again statistically significant.

However, we hypothesize that the correlation observed here might be related to underlying correlations between Hispanic/Latino population and other demographic variables. To test this, we perform a multiple regression analysis with Hispanic/Latino population, income, education, and Asian population, yielding the following results.

Coefficients:

hispanic_inc_educ_asian_model_dif <- lm(`% increase in staying completely home` ~ `% hispanic/latino` + `% over 125,000` + `percent associates or higher` + `% Asian`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_inc_educ_asian_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                    10.83791    2.33245   4.647 4.205e-06
## `% hispanic/latino`            -0.01484    0.02609  -0.569 5.696e-01
## `% over 125,000`                0.21030    0.02256   9.320 2.600e-19
## `percent associates or higher`  0.09562    0.03045   3.141 1.775e-03
## `% Asian`                       0.07632    0.01580   4.831 1.752e-06

R-squared:

print(summary.lm(hispanic_inc_educ_asian_model_dif)$r.squared, digits  = 4)
## [1] 0.439

When accounting for education, income, and the Asian population of a block group, the percentage of residents that are Hispanic/Latino loses its predictive ability–its p value is no longer statistically significant–though all three of the other variables are significant. Income appears to be the key variable, followed by education and Asian population. This combined model predicts 44% of the variation in the change in percentage of devices staying completely at home.

Income, education level, Asian population, child population, and young adult population together provide the greatest predictive ability

As we have shown, income, education level, and Asian population together are very strong predictors of the change in percentage of devices staying completely at home in a block group. These three variables, when combined with the child and young adult population of a block group, yield a model with the greatest predictive ability for the change in percentage of devices staying completely at home. Though the two age variables are not strong predictors on their own, when included in a multivariable model they do provide additional predictive power beyond that of a model with only income, education level, and Asian population.

Note that this best-predicting model was determined by comparing multiple regression analyses of different combinations of the census variables we include in our analysis, and offers the highest predictive ability that we find using these demographic variables; however, there could be other variables that we do not consider here that may add further explanatory power.

The parameters of the best-predicting model from our analysis are presented below.

Coefficients:

inc_educ_asian_child_yad_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000` + `percent associates or higher` + `% Asian` + `percent less than 18` + `percent 20-29`, sj_dem_distancing_pre_post)
print(summary.lm(inc_educ_asian_child_yad_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                     6.85199    1.91126   3.585 3.663e-04
## `% over 125,000`                0.16596    0.02248   7.382 5.652e-13
## `percent associates or higher`  0.13763    0.02204   6.246 8.312e-10
## `% Asian`                       0.08802    0.01385   6.357 4.243e-10
## `percent less than 18`          0.21799    0.04854   4.491 8.617e-06
## `percent 20-29`                -0.13899    0.04179  -3.326 9.376e-04

R-squared:

print(summary.lm(inc_educ_asian_child_yad_model_dif)$r.squared, digits  = 4)
## [1] 0.4861

All five of these variables are significant in this model. Higher income and educational attainment in a block group, higher percentage of residents that are Asian, and higher percentage of residents that are children are all associated with larger increases in percent of devices staying completely at home; higher percentage of residents that are ages 20-29 is associated with a smaller increase in percentage of devices staying completely at home following shelter-in-place. These five variables together predict about 48% of the variation in the change in percentage of devices staying completely at home.

Summarizing, income predicted about 37% of the change in percent of devices staying completely at home, and adding in education raised this prediction to 41%. Including percentage of residents that are Asian boosted the predictive power to 44%, and adding percentage of residents that are younger than 18 and between the ages of 20-29 raised it to 48%.

Other results

Variables with some correlation but that are less important in the overall model

We briefly summarize the results for other demographic variables that do have some correlation with changes in staying at home, but that are not found to be important in the highest-predicting multiple regression model. These variables include high speed internet access, occupants per room in a household, English language ability, and Spanish language ability.

The analysis on internet access, specifically the percentage of households that have access to high speed internet, was inspired by the paper “Social Distancing, Internet Access and Inequality” by Chiou and Tucker (https://www.nber.org/papers/w26982) that found that the combination of high speed internet access and high income was the key driver of ability to stay at home. We do indeed find a correlation between increase in percentage of devices staying completely at home and percentage of households with broadband such as cable, fiber optic or DSL (coefficient 0.36, p value < 2e-16, R-squared 0.22). However, high speed internet access does not provide any additional information to a model that already incorporates income; including high speed internet access in a regression with income raises the R-squared value by less than 0.01 relative to the R-squared of 0.37 for the model with income alone, and does not provide additional useful information in the multivariable regression model.

Similarly, though the percentage of households that have 1 or fewer occupants per room also correlates with change in devices staying at home (coefficient 0.37, p value < 2e-16, R-squared 0.18), this metric also fails to provide significant additional predictive power over income (R-squared 0.38 for income and occupants per room combined).

Percentage of residents speaking English well provided some, but less, predictive power on its own (coefficient 0.36, p value < 2e-16, R-squared 0.12), but again was not significant in multiple regression analyses that incorporated other demographic variables.

Percentage of residents speaking Spanish did offer some predictive power (coefficient of -0.24, p value < 2e-16, R-squared 0.25) but is a very similar metric to Hispanic/Latino population, and was similarly insignificant when combined with education and income.

Variables without strong correlation

Demographic variables we consider that lack strong correlations with changes in staying at home include percentage of residents ages 65 and older (R-squared 0.04), percentage of households with a vehicle available (R-squared 0.08), percentage of workers that are male (R-squared 0.0002), and, as mentioned previously, percentage of residents that are White (R-squared 0.005).

Discussion and final notes

All of the analyses considered here are correlatory, and thus should not be taken to indicate any kind of causal relationship. Our results on shelter-in-place behavior and Hispanic/Latino population exemplify how apparent correlations may disappear when the effects of other variables are taken into account. Thus, though our results suggest that block groups with more residents of higher incomes and higher education levels tend to be those with greater increases in staying at home, they do not explain why this relationship exists nor rule out the possibility of other underlying factors not considered in this analysis driving these trends. For example, block groups with more residents of higher income and education level may also be those with more residents who can work from home, while those with more residents of lower income and education level may be those with more residents who are essential workers. Further analysis is needed to better characterize the relationships that our results identify between demographic variables and social distancing compliance.

Our full analyses can be viewed here https://stanfordfuturebay.github.io/simone_sd_correlations_analysis_sj_01.html.

Contact Simone Speizer () with questions.